Lab 2 - Aprendizaje no supervisado

Clustering

Preprocesamiento

s

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Warning: package 'clustertend' was built under R version 4.3.3
## Package `clustertend` is deprecated.  Use package `hopkins` instead.
## Warning: package 'NbClust' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## ---------------------
## Welcome to dendextend version 1.19.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## 
## Attaching package: 'dendextend'
## 
## The following object is masked from 'package:stats':
## 
##     cutree
## 'data.frame':    19883 obs. of  28 variables:
##  $ id                       : int  1627085 1626914 1626898 1626808 1626678 1626234 1626010 1625551 1625043 1624457 ...
##  $ budget                   : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ genres                   : chr  "Drama|Crime" "Animation" "Animation" "Thriller|Mystery|Documentary" ...
##  $ homePage                 : chr  "" "" "" "" ...
##  $ productionCompany        : chr  "" "" "" "" ...
##  $ productionCompanyCountry : chr  "" "" "" "" ...
##  $ productionCountry        : chr  "" "" "" "" ...
##  $ revenue                  : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ runtime                  : int  95 3 2 5 12 14 39 90 96 106 ...
##  $ video                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ director                 : chr  "Javad Hakami" "Kimmy Gatewood" "Kimmy Gatewood" "Felipe Roldán" ...
##  $ actors                   : chr  "Mohsen Ghasabian|Aida Mahiani|Mehran Ghafourian|Payam Ahmadinia|Masoud Karamati|Roya Javidnia|Nasim Adabi|Siavash Cheraghipour" "Kameron Jackson|Laura Weaving|sara weaving|Bertha Williams" "Cedric Mitchell|Cajun mills|Laura Williams" "Tomás Tuchsznajder|Matias Junas|Martin Etcheverry|Romeo Jeirfimczuk|Agustin Pulido|Alec Drach|Franco Serio" ...
##  $ actorsPopularity         : chr  "0.3453|0.1664|0.9684|0.3437|0.3713|0.2437|0.2796|0.2639" "0|0.0071|0|0" "0.0193|0|0.0143" "0|0|0|0|0|0|0" ...
##  $ actorsCharacter          : chr  "|||||||" "Prince Charming|Evil Stepmother|Fairy Godmother|Cinderella" "Aladdin|Jafar|Jasmine" "||||||" ...
##  $ originalTitle            : chr  "غوطه ور" "Cinderella" "Aladdin" "EL ANILLO Y EL DECK" ...
##  $ title                    : chr  "Immersed" "Cinderella" "Aladdin" "THE RING AND THE DECK" ...
##  $ originalLanguage         : chr  "fa" "en" "en" "es" ...
##  $ popularity               : num  0.0357 0.0357 0.0214 0.0429 0.0379 ...
##  $ releaseDate              : chr  "2026-02-01" "2026-02-01" "2026-02-01" "2026-02-01" ...
##  $ voteAvg                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ voteCount                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ genresAmount             : int  2 1 1 3 1 1 1 1 3 1 ...
##  $ productionCoAmount       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ productionCountriesAmount: int  0 0 0 0 0 0 0 1 1 0 ...
##  $ actorsAmount             : int  8 4 3 7 3 3 5 4 5 5 ...
##  $ castWomenAmount          : int  2 0 0 0 0 0 0 3 1 2 ...
##  $ castMenAmount            : int  5 0 0 0 0 0 3 0 3 3 ...
##  $ releaseYear              : int  2026 2026 2026 2026 2026 2026 2026 2026 2026 2026 ...
##        id              budget             genres            homePage        
##  Min.   :      5   Min.   :        0   Length:19883       Length:19883      
##  1st Qu.: 146220   1st Qu.:        0   Class :character   Class :character  
##  Median : 869623   Median :        0   Mode  :character   Mode  :character  
##  Mean   : 902240   Mean   :  9413280                                        
##  3rd Qu.:1589602   3rd Qu.:  1000000                                        
##  Max.   :1627166   Max.   :380000000                                        
##                                                                             
##  productionCompany  productionCompanyCountry productionCountry 
##  Length:19883       Length:19883             Length:19883      
##  Class :character   Class :character         Class :character  
##  Mode  :character   Mode  :character         Mode  :character  
##                                                                
##                                                                
##                                                                
##                                                                
##     revenue             runtime         video           director        
##  Min.   :0.000e+00   Min.   :  0.00   Mode :logical   Length:19883      
##  1st Qu.:0.000e+00   1st Qu.: 10.00   FALSE:19313     Class :character  
##  Median :0.000e+00   Median : 86.00   TRUE :84        Mode  :character  
##  Mean   :2.879e+07   Mean   : 66.09   NA's :486                         
##  3rd Qu.:3.306e+05   3rd Qu.:103.00                                     
##  Max.   :2.847e+09   Max.   :750.00                                     
##                                                                         
##     actors          actorsPopularity   actorsCharacter    originalTitle     
##  Length:19883       Length:19883       Length:19883       Length:19883      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     title           originalLanguage     popularity        releaseDate       
##  Length:19883       Length:19883       Min.   :    0.000   Length:19883      
##  Class :character   Class :character   1st Qu.:    0.055   Class :character  
##  Mode  :character   Mode  :character   Median :    8.502   Mode  :character  
##                                        Mean   :   26.249                     
##                                        3rd Qu.:   22.239                     
##                                        Max.   :11474.647                     
##                                                                              
##     voteAvg         voteCount        genresAmount    productionCoAmount
##  Min.   : 0.000   Min.   :    0.0   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.: 0.000   1st Qu.:    0.0   1st Qu.: 1.000   1st Qu.: 0.000    
##  Median : 5.400   Median :    6.0   Median : 2.000   Median : 1.000    
##  Mean   : 3.837   Mean   :  675.9   Mean   : 1.949   Mean   : 1.973    
##  3rd Qu.: 6.800   3rd Qu.:  423.0   3rd Qu.: 3.000   3rd Qu.: 3.000    
##  Max.   :10.000   Max.   :30788.0   Max.   :16.000   Max.   :89.000    
##                                                                        
##  productionCountriesAmount  actorsAmount    castWomenAmount  castMenAmount   
##  Min.   :  0.00            Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  1.00            1st Qu.:     3   1st Qu.:     0   1st Qu.:     0  
##  Median :  1.00            Median :     9   Median :     2   Median :     3  
##  Mean   :  1.23            Mean   :  1082   Mean   :  3516   Mean   :  8224  
##  3rd Qu.:  1.00            3rd Qu.:    21   3rd Qu.:     6   3rd Qu.:    12  
##  Max.   :155.00            Max.   :919590   Max.   :922162   Max.   :922017  
##                                             NA's   :37       NA's   :162     
##   releaseYear  
##  Min.   :1902  
##  1st Qu.:2013  
##  Median :2021  
##  Mean   :2017  
##  3rd Qu.:2025  
##  Max.   :2026  
##  NA's   :2

Selección de variables

Eliminamos variables que no aportan a la formación de grupos:

id, originalTitle, title, homePage, actorsCharacter, releaseDate, director, productionCompany, actors

Variables textuales no numéricas

Trabajaremos con variables numéricas relevantes para comportamiento y desempeño:

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(everything(), ~as.numeric(as.character(.)))`.
## Caused by warning:
## ! NAs introduced by coercion

Tendencia al Agrupamiento

Estadistico de Hopkins

set.seed(123)
hopkins_stat <- hopkins(movies_scaled, n = nrow(movies_scaled)-1)
## Warning: Package `clustertend` is deprecated.  Use package `hopkins` instead.
hopkins_stat
## $H
## [1] 0.02059509

VAT

fviz_dist(dist(movies_scaled))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Numero Optimo de Clusters

Metodo del Codo

fviz_nbclust(movies_scaled, kmeans, method = "wss") +
  ggtitle("Método del Codo")

Metodo de silueta

fviz_nbclust(movies_scaled, kmeans, method = "silhouette") +
  ggtitle("Método de Silueta")

1.4 Aplicación de K-medias y Clustering Jerárquico

K-medias

set.seed(123)

k_opt <- 3  # Ajustar según los métodos del codo y silueta

kmeans_model <- kmeans(movies_scaled, centers = k_opt, nstart = 25)

# Agregar cluster al dataset
movies_clean$cluster_kmeans <- as.factor(kmeans_model$cluster)

# Visualización
fviz_cluster(kmeans_model, data = movies_scaled,
             ellipse.type = "convex",
             ggtheme = theme_minimal(),
             main = "Clusters con K-means")

# Matriz de distancias
dist_matrix <- dist(movies_scaled)

# Método de Ward
hc_model <- hclust(dist_matrix, method = "ward.D2")

# Dendrograma
plot(hc_model, labels = FALSE, main = "Dendrograma - Clustering Jerárquico")

# Cortar el árbol en k clusters
hc_clusters <- cutree(hc_model, k = k_opt)

movies_clean$cluster_hc <- as.factor(hc_clusters)

# Visualización
fviz_cluster(list(data = movies_scaled, cluster = hc_clusters),
             ellipse.type = "convex",
             ggtheme = theme_minimal(),
             main = "Clusters Jerárquicos")

table(Kmeans = movies_clean$cluster_kmeans,
      Jerarquico = movies_clean$cluster_hc)
##       Jerarquico
## Kmeans    1    2    3
##      1    3   30    0
##      2    0    0    6
##      3 1033    3    0
sil_kmeans <- silhouette(kmeans_model$cluster, dist_matrix)
fviz_silhouette(sil_kmeans)
##   cluster size ave.sil.width
## 1       1   33         -0.13
## 2       2    6          0.40
## 3       3 1036          0.75

mean(sil_kmeans[,3])
## [1] 0.7200377
sil_hc <- silhouette(hc_clusters, dist_matrix)
fviz_silhouette(sil_hc)
##   cluster size ave.sil.width
## 1       1 1036          0.71
## 2       2   33         -0.07
## 3       3    6          0.40

mean(sil_hc[,3])
## [1] 0.688182
movies_clean %>%
  group_by(cluster_kmeans) %>%
  summarise(across(where(is.numeric),
                   list(media = mean,
                        mediana = median),
                   na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## ℹ In group 1: `cluster_kmeans = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 3 × 29
##   cluster_kmeans popularity_media popularity_mediana budget_media budget_mediana
##   <fct>                     <dbl>              <dbl>        <dbl>          <dbl>
## 1 1                        21.4                17.0        34685.              0
## 2 2                        37.2                23.9          100               0
## 3 3                         0.184               0.04         194.              0
## # ℹ 24 more variables: revenue_media <dbl>, revenue_mediana <dbl>,
## #   runtime_media <dbl>, runtime_mediana <dbl>, genresAmount_media <dbl>,
## #   genresAmount_mediana <dbl>, productionCoAmount_media <dbl>,
## #   productionCoAmount_mediana <dbl>, productionCountriesAmount_media <dbl>,
## #   productionCountriesAmount_mediana <dbl>, voteCount_media <dbl>,
## #   voteCount_mediana <dbl>, voteAvg_media <dbl>, voteAvg_mediana <dbl>,
## #   actorsPopularity_media <dbl>, actorsPopularity_mediana <dbl>, …
table(movies_clean$cluster_kmeans)
## 
##    1    2    3 
##   33    6 1036

Analisis de Componentes Principales PCA

¿Se pueden incluir variables categoricas?

Las variables categóricas como:

-originalLanguage

-genres

-productionCountry

-director

-productionCompany

No pueden incluirse directamente en PCA porque:

-PCA trabaja con matriz de covarianza/correlación

-Requiere variables numéricas continuas

-Transformarlas con One-Hot Encoding generaría cientos de variables

-Alta cardinalidad → distorsiona la varianza

¿Es conveniente aplicar PCA?

Trabajando solo con variables numericas:

## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(everything(), as.numeric)`.
## Caused by warning:
## ! NAs introduced by coercion

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_matrix)
## Overall MSA =  0.66
## MSA for each item = 
##                popularity                    budget                   revenue 
##                      0.80                      0.63                      0.79 
##                   runtime              genresAmount        productionCoAmount 
##                      0.77                      0.69                      0.66 
## productionCountriesAmount                 voteCount                   voteAvg 
##                      0.67                      0.70                      0.84 
##          actorsPopularity              actorsAmount           castWomenAmount 
##                      0.74                      0.57                      0.75 
##             castMenAmount 
##                      0.56
## $chisq
## [1] 4630.2
## 
## $p.value
## [1] 0
## 
## $df
## [1] 78

La matriz de correlación evidencia dependencias lineales moderadas y fuertes entre varias variables, lo que justifica la aplicación de PCA para reducir dimensionalidad. El índice KMO global de 0.66 indica que la estructura de correlaciones es aceptable para aplicar análisis factorial o PCA. El test de esfericidad de Bartlett resultó altamente significativo (p < 0.001), lo que confirma que la matriz de correlación no es identidad y que existen correlaciones suficientes para aplicar PCA. Se seleccionaron los primeros 4 componentes principales, ya que presentan valores propios mayores a 1 y explican aproximadamente el 55% de la variabilidad total.

Aplicacion del PCA

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7148 1.3175 1.1751 1.04644 0.97640 0.95403 0.93904
## Proportion of Variance 0.2262 0.1335 0.1062 0.08423 0.07334 0.07001 0.06783
## Cumulative Proportion  0.2262 0.3597 0.4659 0.55017 0.62350 0.69352 0.76135
##                            PC8     PC9    PC10    PC11   PC12    PC13
## Standard deviation     0.89463 0.87342 0.75619 0.70635 0.6629 0.17050
## Proportion of Variance 0.06157 0.05868 0.04399 0.03838 0.0338 0.00224
## Cumulative Proportion  0.82291 0.88160 0.92558 0.96396 0.9978 1.00000
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

##                                   PC1           PC2        PC3         PC4
## popularity                 0.43542868 -0.0005943905  0.1126876  0.17617442
## budget                     0.02997437 -0.1602710629 -0.3761440 -0.28149110
## revenue                    0.08359436 -0.1257894030  0.2979818 -0.27104635
## runtime                    0.27399198 -0.0377905466 -0.1459826 -0.41236679
## genresAmount               0.08715012 -0.2679096337 -0.1316260  0.45333878
## productionCoAmount         0.19825057 -0.3941890154 -0.4130260 -0.06164206
## productionCountriesAmount  0.18355760 -0.3420069390 -0.4116726 -0.04580917
## voteCount                  0.30696326 -0.3085546127  0.3603822  0.17345427
## voteAvg                    0.28113008 -0.1768463787  0.1086102  0.30071303
## actorsPopularity           0.31813996 -0.1917610680  0.3810032 -0.18535925
## actorsAmount              -0.41269398 -0.4743279116  0.1756529 -0.07105981
## castWomenAmount            0.15909004  0.0038582521  0.1772479 -0.52135099
## castMenAmount              0.42041931  0.4740017764 -0.1709301  0.05540265
##                                    PC5          PC6         PC7         PC8
## popularity                 0.157559615 -0.127312615 -0.01076350 -0.28624741
## budget                     0.480312399 -0.098090786 -0.67947694  0.14593230
## revenue                    0.526069546  0.665458564  0.27903130  0.12206156
## runtime                   -0.240979426 -0.145207262  0.23600335  0.59991198
## genresAmount              -0.291973025  0.546011173 -0.34514584  0.12045604
## productionCoAmount        -0.050127656  0.022183150  0.14772374 -0.11924923
## productionCountriesAmount  0.005961045  0.007171287  0.38237936 -0.35839839
## voteCount                  0.143992389 -0.197268630 -0.09558008 -0.11552709
## voteAvg                   -0.111308387 -0.024802732  0.01813402  0.50480040
## actorsPopularity          -0.025557041 -0.254596533 -0.13662058 -0.02947734
## actorsAmount              -0.068741713 -0.113075913  0.01982208  0.02852342
## castWomenAmount           -0.529146321  0.287800197 -0.29746286 -0.30682904
## castMenAmount              0.057968715  0.107764224 -0.02169366 -0.02822801
##                                   PC9          PC10        PC11         PC12
## popularity                 0.04474196 -0.0976154616  0.21432457  0.764664825
## budget                     0.08016029  0.1430284962  0.03374290  0.036938109
## revenue                    0.02148098 -0.0300767463 -0.02465417  0.024738445
## runtime                   -0.35090728  0.0100551672  0.30210473  0.150024941
## genresAmount              -0.41315762  0.0335523672 -0.02401737  0.091462226
## productionCoAmount         0.18498095 -0.7251319870 -0.11276150 -0.138164226
## productionCountriesAmount -0.03443635  0.6259899218 -0.08154153 -0.057341237
## voteCount                 -0.12167118 -0.0003315423  0.53472861 -0.517081809
## voteAvg                    0.67180256  0.2098499612 -0.15445571 -0.006339279
## actorsPopularity          -0.31176774 -0.0031653165 -0.70670043 -0.029053038
## actorsAmount               0.02293111  0.0054241540  0.07151350  0.244788720
## castWomenAmount            0.31124569  0.0776611299  0.15141348 -0.009840161
## castMenAmount             -0.03392206 -0.0097469870 -0.06207499 -0.178283172
##                                   PC13
## popularity                -0.037425563
## budget                    -0.000603145
## revenue                    0.001447045
## runtime                   -0.019570362
## genresAmount              -0.002987639
## productionCoAmount         0.005923534
## productionCountriesAmount  0.005332753
## voteCount                  0.019048033
## voteAvg                    0.007568777
## actorsPopularity          -0.002785658
## actorsAmount               0.696049007
## castWomenAmount           -0.008413845
## castMenAmount              0.716350844
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

El análisis de componentes principales permitió identificar cuatro dimensiones fundamentales en el desempeño de las películas: impacto comercial, estructura del elenco, complejidad productiva e inversión económica.

Esta reducción facilita la construcción futura de modelos predictivos, disminuye problemas de colinealidad y permite segmentar películas en función de características estructurales más compactas.